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This article proposes a framework for the depth map reconstruction using 
stereo images. Fundamentally, this map provides an important information 
which commonly used in essential applications such as autonomous vehicle 
navigation, drone’s navigation and 3D surface reconstruction. To develop an 
accurate depth map, the framework must be robust against the challenging 
regions of low texture, plain color and repetitive pattern on the input stereo 
image. The development of this map requires several stages which starts 
with matching cost calculation, cost aggregation, optimization and 
refinement stage. Hence, this work develops a framework with sum of 
absolute difference (SAD) and the combination of two edge preserving 
filters to increase the robustness against the challenging regions. The SAD 
convolves using block matching technique to increase the efficiency of 
matching process on the low texture and plain color regions. Moreover, two 
edge preserving filters will increase the accuracy on the repetitive pattern 
region. The results show that the proposed method is accurate and capable to 
work with the challenging regions. The results are provided by the 
Middlebury standard dataset. The framework is also efficiently and can be 
applied on the 3D surface reconstruction. Moreover, this work is greatly 
competitive with previously available methods. 


This is an open access article under the CC BY-SA license. 


Corresponding Author: 
Rostam Affendi Hamzah 


Fakulti Teknologi Kejuruteraan Elektrik and Elektronik, Universiti Teknikal Malaysia Melaka (UTeM) 
Jalan Hang Tuah Jaya, 76100 Durian Tunggal, Melaka, Malaysia 


Email: rostamaffendi@utem.edu.my 


1. INTRODUCTION 


Depth map contains important information for many applications such as range estimation, size 
measurement and 3D surface reconstruction. This article introduces depth map estimation from stereo images 
which is part of stereo vision field of study. Stereo vision is the most important field in computer vision and 
it provides various algorithms for computing different image processing related field of studies. Basically, 
these stereo images go through a stereo matching process to get the depth map. Based on several literature 
works in Bhuiyan and Khalifa [1] with previous published articles, the depth map is also called as a disparity 
map. The process is using two stereo images, the scene depth can be obtained from two separate points with 
some baseline displaced values. The correlation values of left image compared with right image are the result 
of stereo matching. The depth map is determined using the different intensity of pixel values on the map or 
the output from the stereo matching process based on the disparity map [2], [3]. The stereo matching function 
for computing the exact depth map is very challenging and difficult. Based on the works in Hamzah et al. [4] 
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the challenges are due to repetitive pattern, low texture area and plain color regions. These area or regions are 
hard to be matched and remains a challenge for the researchers to get an accurate result. 

The process of stereo matching is one of the most interesting tasks in computer vison research area. 
This process begins with the matching of corresponding points from two images (i.e., left and right input 
images). Essentially, the framework of the matching process was proposed by Vedamurthy [5]. It has four 
stages which begins with matching cost, aggregation stage, disparity computation and disparity map 
refinement step at the last stage. Based on the proposed work in [5], there are three major methods to develop 
the stereo matching framework. The first is local method which uses all four stages in developing the 
framework. The optimization of local method employs winner-takes-all (WTA) strategy. In difference, global 
method uses energy minimization approach from the Markov random field (MRF) technique. This method 
utilizes three stages which excludes the aggregation stage. The third method is semi-global which this 
method combines both of the local and global methods. Semi-global employs all four stages [6]. 

Commonly, the depth map of global method is calculated by manipulating the global energy feature. 
The graph cuts (GC) [7], belief propagation (BP) [8] and dynamic programming (DP) [9] are the algorithms 
with global method. An algorithm to identify the depth discontinuities from pairs of stereo images as 
implemented in [9]. Their approach is able to execute the dynamic programming acceleration. A new stereo 
matching based on segmentation using graph cuts, which is used by assigning disparity planes to each 
segment to achieve the optimal solution [10]. Suggested integration of cost allocation-filtering approaches 
and global energy minimization approaches to encourage the increasing of stereo matching calculation using 
a two-stage energy minimization algorithm based on MRF modeling. This method is successfully solved the 
problem of stereo matching in occlusion areas. While previous approaches can effectively yield precise 
disparities in stereo matching, It isn't easy to enforce them, and complex scenes may fail to implement them. 
In addition, the learning-based methods are not reliable but are dependent on the training data. Eventhough 
global optimization achieves high accuracy of disparity estimation, huge computation complexity and 
effecting prolonged time that makes the global method limits its operation in real-time applications. 
Furthermore, the semi global method is also implemented using MRF approach where the framework is 
similar like global method with the cost aggregation stage is added in the framework. 

Matching cost computation is the first stage which produces preliminary depth map. This stage 
generates high noise due to corresponding process of left and right images at each coordinate. There are 
several available methods in current research such as pixel-based matching cost [11], feature-based matching 
cost [12] and block matching cost [13]. All of these techniques have its own advantages and disadvantages as 
stated in [14]. Pixel based is fast but produced high noise compared to the other matching cost techniques. 
Feature based technique creates sparse depth map which only discovers image features such as edges or 
boundaries. The last matching cost technique is block-based method which this method is able to produce 
high accuracy if the windows size is suitably selected. The aggregation stage for stereo matching algorithm is 
very important to remove the preliminary noise after matching cost process. In common, this stage is applied 
in local and semi-global methods. The aggregation process implements the filtering by summing or averaging 
the cost volume in a support window. Some of available algorithms utilize the segmentation to improve the 
accuracy. Žbontar and LeCun [15], the adaptive support windows are applied to increase the accuracy by 
adjusting the neighboring pixels intensity. Hirschmüller et al. [16] has proposed a cost aggregation approach 
based on the segment-tree for non-local stereo matching, leading to improvements in both the precision of 
disparities and the processing speed. In placing more emphasis, [17], [18] a cross-scale design was suggested 
to enhance the cost averaging for effective stereo matching. Recursive edge-aware (REAF) [19] filters 
provided for precise and effective stereo matching. 

The third stage is disparity optimization which this stage normalizes the disparity value and convert 
it to the intensity of depth pixel on the map. Local based methods embrace this stage with the same approach 
as implemented in [20]. Global and semi global methods as implemented in [21] skip this stage due to these 
methods minimize the disparity with the energy minimization approach similar to the Markov random field 
(MRF) technique. For the final stage of the stereo matching framework, the depth map refinement stage is 
taking place. This stage removes remaining noise on the depth map as implemented in [22]. It has two 
sequential processes at this stage which are invalid pixels detection and final depth map improvement or 
filtering. Einecke and Eggert [23] the segment-based approach was utilized based on fixed plane from the 
initial depth pixel on each segment. The assumption of depth pixels is varied smoothly and continuously 
within each homogeneous color segmentation to improve the accuracy. The rest of this paper is organized as 
follows. Next section explains the methodology of the proposed work in this article and followed with the 
results and discussion. Last part is the conclusion of the performance work in this article. 
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2. RESEARCH METHODOLOGY 

The proposed framework in this article is shown in Figure 1. Stage 1: the corresponding process 
begins with the sum of absolute differences (SAD) cost function to get the preliminary disparity map. This 
function uses 17x17 windows size with block matching technique. Proper selection of windows sizes 
produces more accurate results and it should improve the efficiency of the corresponding process. Stage 2: at 
this stage the preliminary disparity map will be filtered to remove the noise or invalid pixels. The guided 
filter (GF) [24] will be used in the framework. The edge preserving filter of GF is one of the non-linear types 
that enhances the produced results. Hence, the used of this type of filter eliminates the invalid pixels on the 
disparity map and capable to increase the accuracy. Stage 3: this step introduces the optimization level of 
disparity assortment. The proposed work in this article uses winner-takes-all strategy where the normalization 
of disparity value is selected based on the minimum number. Stage 4: the final stage in the proposed 
framework is to further improve the accuracy by two continuous processes. The processes are fill-in invalid 
pixels then followed by the final filtering process which uses the bilateral filter (BF) [25]. The purpose of fill- 
in process is to replace the invalid pixel with neighboring valid pixel. This process makes the disparity map 
more accurate and is capable to reduce the error on the disparity map. The BF is used in the framework due 
to this filter is able to eliminate remaining noise and maintained the object edges. The capability of this filter 
at this stage is upsurge the accuracy on the disparity map. 


Stage 1 Stage 2 Stage 3 Stage 4 


Sum of Absolute Guided Filter (GF) Optimization with eFill-in invalid Pixels 
Difference (SAD) Winner-takes-all *Bilateral Filter (BF) 


(WTA) 


Figure 1. The stages of the proposed algorithm 


2.1. Matching cost computation 

The SAD is used in the proposed framework as shown in Figure 1 which this stage produces 
preliminary disparity map. Thus, early stage is very important where the corresponding points between two 
pixels of left and right images take place. One of the major problems for matching procedure at this step is 
corresponding points on the textureless regions. To minimize the mismatched problem during the matching 
process, block matching technique or window-based technique is applied in this framework with the SAD 
function. The advantage of block matching technique is accomplished to reduce the error on the textureless 
region. The consistency of weight at every pixel in the SAD windows increases the matching correctness and 
at the same time reduces the mismatched error. In (1) explains the SAD function which this article uses the 
red, green, amd blue (RGB) images as an input stereo images. 


SAD(x,y,d) == Zayn iy) - i- ad,y)| a) 


Where the (x, y, d) is the pixel of interest with the disparity value, the left and right images are Zand Ir 
respectively, the RGB channels denoted by i of right and left images and the SAD support window size 
represented by M at the size of (17x17). 


2.2. Cost aggregation 

This stage is the most important part which removes the noise from preliminary depth map. 
Fundamentally, this stage will filter out the noise from matching process and should be proficient to keep the 
object edges. Some of illogical and ambiguities pixels are formed during the corresponding process. 
Therefore, at this step the filter must be strong and robust against any error at this stage. The GF is employed 
due to deep maintaining object boundaries and at the similar time capably to eliminate noise particularly on 
the low texture and flat color regions. In (2) is the equation of BF applied in this paper. 


GF@q)U) = — Lep.ayewe (1 + Gort Marte) "7 


oł+e 


Where {2,0,4,1 p,q,w,c } denoted by { center pixel of w, variance value, constant parameter, reference image 
(left input image), mean value, coordinates of (x,y), window support size, neighboring coordinates }. The GF 
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is engaged in this paper in line for efficient noise removal with fast execution of time processing. The GF 
improves the precision at the object boundaries. The final calculation of this step is provided by (3), 


CA(x,y,d) = SAD(x, y, d)GFo q (D (3) 


where the CA is the cost aggregation, SAD (x, y, d) signifies the input of the first step and GF q)(/) signifies 
the core of the GF. 


2.3. Disparity optimization 

This stage optimizes the depth map by using the winner-takes-all (WTA) strategy. The strategy 
involves the selection of the minimum disparity value is normalized by using the floating-point number from 
the cost aggregation stage. The WTA is usually used in the local methods due to fast execution [17], [25]. 
The WTA calculation is provided by (4). 


dyy =arg CA(x,y,d) (4) 


Where C(x, y, d) denotes the second stage of aggregation step and D represents a set of valid disparity values 
for an image. Fundamentally, after this stage the disparity map still contains noise or invalid pixels. Thus, 
this map needs to be enhanced or to be refined to get the best results. 


2.4. Refinement stage 

This stage is the last part of the proposed algorithm which has two continuous post-processes. It 
starts with hole filling process or technically the process of invalid pixel replacement with the valid pixel 
value. This article uses the nearest valid pixel value (i.e., close neighbor) to fill-in the hole or to replace the 
invalid pixel locations. After this process, some unwanted pixels or artifacts emerged. Therefore, the depth 
map needs to be smoothed to filter out the artifacts. Thus, the bilateral filter (BF) is utilized. This filter is 
very robust against the low texture and repetitive pattern regions. It is because the characteristic of this filter 
is able to remove the noise and at the same time maintained the object edges. The BF function is shown by 
(5) as follows: 


2 = 2 
BF, ay = Digews: exp (- pet) exp (- lezal) (5) 
where wz and q are the size of window of BF and neighboring pixels and p is the location pixel of interest at 
(x,y) respectively. The p-q represent spatial Euclidean interval and J,-J, signifies the Euclidean distance in 
color space. The o; indicates a factor of spatial adjustment and o, corresponds to similarity factor for the 
color detection. Therefore, the final result of the depth map is represented by (5) which the BF,, is the depth 
output d at the location pixel of interest p. 


2.5. 3D Reconstruction 

This section explains the application of the depth map. It will prove the accuracy of the implied 
depth map in this article based on qualitative measurement. From the depth map result in (5), a library from 
the OpenCV will be utilized for 3D surface reconstruction. It based on the (6): 


p= (6) 


where D represents the depth, b and f are the stereo camera baseline and focal values respectively, and d 
denotes the disparity value. Therefore, the 3D surface reconstruction in this article formulated as (7) as 
follows: 


— f 
3D = cae (7) 


3. RESULTS AND DISCUSSION 

This segment describes the investigational results and analysis on the implementation of the 
proposed framework. The experimental has been conducted based on the C++, 10 gen intel core 15 processor 
and 8GBHz random access memory (RAM). This article uses a standard benchmarking evaluation image 
which was provided by the Middlebury stereo assessment system. This database contains 15 images of 
training images which these images have different characteristics such as various illumination, low texture, 
discontinuity, and plain color regions. The all and nonocc errors are the attributes to the performance 
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measurement of the proposed algorithm. The measurement is based on the weight average error of testing 
and training images which is provided by the Middlebury online webpage. Figure 2 demonstrates the depth 
map results in grayscale color of the challenging regions. The results show 3 images with clear matching 
results on the depth map. Low texture region of the chair on Adirondack image is clearly observed in the 
depth map result. The plain color region of recycle bin image is also visible on the depth map. For the 
repetitive region, the Jadeplant image is selected due to its repetitive leaves on the image. The depth map 
result shows the leaves are well detected with its object sizes. Based on this figure, the proposed work in this 
article is strong against the challenging regions. Overall, 15 training images have been used for performance 
measurement. There are Recycle, Adirondack, Teddy, Playroom, Jadeplant, MotorcycleE, Pipes, Piano, 
PianoL, Playtable, ArtL, PlaytableP, Shelves, Motorcycle and Vintage as displayed in Figure 3. This figure 
shows almost accurate depth map results that were constructed based on RGB color channels. 
Fundamentally, the red color implies the objects is closer to the stereo camera. Meanwhile blue color region 
is far away from the stereo camera. It is also applied on the depth map with the grayscale color scheme. 


Challenging Regions Left Reference Image Depth Map Results (Grayscale) 


Low texture region for the 
Adirondack image 


Plain color region for the 
Recycle image 


Repetitive region for the 
Jadeplant image 


Figure 2. The depth map results for the challenging regions using training images 


Adirondack MotorcycleE 


pel 


Figure 3. The disparity map results from the proposed work using the training images 
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Tables 1 and 2 are the quantitative measurement results based on the proposed work in this article 
using the Middlebury standard stereo evaluation system. There are 15 training images with the quantitative 
results provided by the Middlebury through online system. Table 1 tabulates the results of the nonocc error 
which also displays the proposed algorithm produces 8.83% of average error. It ranks at second behind the 
morphological processing stereoscopic visualization (MPSV) method with 8.81% and more accurate than the 
accurate dense stereo matching (ADSM) and binary stereo matching (BSM) methods with 8.95% and 13.40% 
respectively. For Table 2, the all error is presented where the evaluation is made with the same methods in 
Table 1 to make the comparison process more consistent. The proposed work produces the lowest average 
error compared to the ADSM, BSM and more importantly the proposed work is more accurate than the 
MPSV. It shows that the proposed work in this article is competitive with other available methods. There are 
several complex images such as Jadeplant, Piano, PianoL and Playtable produce the lowest average error 
compared with other methods in Table 2. 

Figure 4 demonstrates the results of 3D surface reconstruction from a depth map result. 
Fundamentally, the accuracy of the depth map produced by the proposed framework is very crucial. It 
determines the quality of 3D surface reconstruction and overall algorithm features. Based on this figure, the 
motorcycle depth map (grayscale) is obviously discovered where the contour of depth detection is accurately 
presented for both conditions (i.e., front view and top view). The motorcycle is closer to the stereo camera 
and is clearly separated from the background grayscale tones. It shows that the background objects are far 
away from the motorcycle position. It can be proved by the 3D surface reconstruction from top view image 
where the depth distance is well-positioned. The depth layers are also showed in this image which indicates 
the matching process from the first to the final stage of the proposed work is efficiently established. 
Fundamentally, the pixel intensities are accurately positioned on the depth map of each object. Hence, this 
helps to accurately reconstruct the 3D surface with accurate depth estimation. 


Table 1. The comparison of current available methods with the nonocc error from the Middlebury 
Algorithms MPSV [17] Proposed algorithm ADSM [18] BSM [25] 


Adiron 3.83 7.07 13.3 7.27 
ArtL 6 7.01 6.1 11.4 
Jadepl 19.7 14.6 15 30.5 
Motor 5.85 4.39 3.67 6.67 
MotorE 5:53 5.08 5.67 6.52 
Piano 5.68 6.37 7.08 10.8 
PianoL 34.3 12.9 20.6 32.1 
Pipes 9.59 9.41 6.57 10.5 
Playrm 5.86 14.8 13.2 12.5 
Playt 15.3 9.63 23.1 24.4 
PlayP 4.2 7.59 3.55 12.8 
Recyc 4.59 7.76 5.76 7.42 
Shelvs 13 18 17.2 16.4 
Teddy 3:1. 5.01 3.05 4.88 
Vintge 14.3 17 10.1 32.8 
Ave 8.81 8.83 8.95 13.4 


Table 2. The comparison of current available methods with the all error from the Middlebury 
Algorithms Proposed algorithm ADSM [18] MPSV [17] BSM [25] 


Adiron 9.06 14.3 5.87 12.7 
ArtL 10.1 10.6 9.43 28.7 
Jadepl 30.6 34.1 40.2 58.7 
Motor 7.02 6 9.11 14.8 
MotorE 8.23 8 8.8 14.7 
Piano 6.98 7.37 7.03 16 
PianoL 13.4 20.4 34.2 35.8 
Pipes 15.5 12.1 15.8 24.5 
Playrm 19.3 16.9 8.58 29.4 
Playt 11.9 25.5 16.9 31 
PlayP 9.71 5.84 5.89 20.2 
Recyc 8.29 5.83 6.78 12.1 
Shelvs 17.9 17.2 13.7 19.2 
Teddy 5.83 4.11 4.82 14.3 
Vintge 17.3 11.1 16.8 39.3 
Ave 12.1 12.3 12.7 23.5 
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Left Image Depth Map (grayscale) 


3D surface reconstruction 


Figure 4. The depth map result with the 3D surface reconstruction of the motorcycle image 


4. CONCLUSION 

This paper proposes a framework for depth map reconstruction utilizing stereo images. This work 
established four stages of algorithm framework producing depth map or disparity map. The framework starts 
with matching cost calculation, cost aggregation, optimization and refinement stage. The proposed work in 
this paper used sum of absolute differences (SAD), guided filter (GF), winner-takes-all (WTA) and bilateral 
filter (BF) respectively. Based on the investigational study from a standard benchmark, the proposed work in 
this paper is capable to work with the regions that have difficulty to be matched such as low texture, plain 
color and repetitive pattern regions. It can be proved in from the provided results; the depth maps are 
completely reconstructed using sample standard images. The performance of the proposed framework is also 
measured based on the quantitative measurement. From this measurement result, it can be seen the proposed 
work is competitive with current published works with 8.83% and 12.10% of nonocc and all errors 
respectively. These quantitative measurements are provided from the standard benchmarking evaluation 
system from the Middlebury. Finally, this proposed work is also accomplished to be used for 3D surface 
reconstruction which is presented in this article, the motorcycle image has been used. The depth estimation 
from the 3D reconstruction result shown precise objects location and depth contour. Hence, the projected 
work in this paper can be used as a complete algorithm for the depth map algorithm and viable with other 
available methods. 
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